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Abstract — In this work we address the subspace clustering 
problem. Given a set of data samples (vectors) approximately 
drawn from a union of multiple subspaces, our goal is to cluster 
the samples into their respective subspaces and remove possible 
outliers as well. To this end, we propose a novel objective function 
named Low-Rank Representation (LRR), which seeks the lowest- 
rank representation among all the candidates that can represent 
the data samples as linear combinations of the bases in a given 
dictionary. It is shown that the convex program associated with 
LRR solves the subspace clustering problem in the following 
sense: when the data is clean, we prove that LRR exactly recovers 
the true subspace structures; when the data are contaminated by 
outliers, we prove that under certain conditions LRR can exactly 
recover the row space of the original data and detect the outlier as 
well; for data corrupted by arbitrary sparse errors, LRR can also 
approximately recover the row space with theoretical guarantees. 
Since the subspace membership is provably determined by the 
row space, these further imply that LRR can perform robust 
subspace clustering and error correction, in an efficient and 
effective way. 

Index Terms — low-rank representation, subspace clustering, 
segmentation, outlier detection. 



I. Introduction 

In pattern analysis and signal processing, an underlying 
tenet is that the data often contains some type of structure 
that enables intelligent representation and processing. So one 
usually needs a parametric model to characterize a given set 
of data. To this end, the well-known (linear) subspaces are 
possibly the most common choice, mainly because they are 
easy to compute and often effective in real applications. Sev- 
eral types of visual data, such as motion 0, 0, 0, face 
and texture 0, have been known to be well characterized by 
subspaces. Moreover, by applying the concept of reproducing 
kernel Hilbert space, one can easily extend the linear models 
to handle nonlinear data. So the subspace methods have been 
gaining much attention in recent years. For example, the 
widely used Principal Component Analysis (PCA) method and 
the recently established matrix completion [ 6 ] and recovery 
methods are essentially based on the hypothesis that the data 
is approximately drawn from a low-rank subspace. However, 
a given data set can seldom be well described by a single 



Fig. 1 

A MIXTURE OF SUBSPACES CONSISTING OF A 2D PLANE AND TWO ID 
LINES. (A) THE SAMPLES ARE STRICTLY DRAWN FROM THE UNDERLYING 

SUBSPACES. (b) The samples are approximately drawn from the 

UNDERLYING SUBSPACES. 



subspace. A more reasonable model is to consider data as 
lying near several subspaces, namely the data is considered 
as samples approximately drawn from a mixture of several 
low-rank subspaces, as shown in FigQ] 

The generality and importance of subspaces naturally lead 
to a challenging problem of subspace segmentation (or clus- 
tering), whose goal is to segment (cluster or group) data 
into clusters with each cluster corresponding to a subspace. 
Subspace segmentation is an important data clustering problem 
and arises in numerous research areas, including computer 
vision 0, 0, 0, image processing 0, ifTOll and system 
identification lITO . When the data is clean, i.e., the samples 
are strictly drawn from the subspaces, several existing methods 
(e.g., fl2ll . |[T3lL lfl4ll ) are able to exactly solve the subspace 
segmentation problem. So, as pointed out by 0, lfl4l . the 
main challenge of subspace segmentation is to handle the 
errors (e.g., noise and corruptions) that possibly exist in data, 
i.e., to handle the data that may not strictly follow subspace 
structures. With this viewpoint, in this paper we therefore 
study the following subspace clustering lfl5l problem. 

Problem 1.1 (Subspace Clustering): Given a set of data 
samples approximately (i.e., the data may contain errors) 
drawn from a union of linear subspaces, correct the possible 
errors and segment all samples into their respective subspaces 
simultaneously. 
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(a) noise (b) random corruptions (c) sample-specific corruptions 



Fig. 2 

Illustrating three typical types of errors: (a) noise (6|, 
which indicates the phenomena that the data is slightly 
perturbed around the subspaces (what we show is a perturbed 

data matrix whose columns are samples drawn from the 
subspaces); (b) random corruptions q, which indicate that a 

fraction of random entries are grossly corrupted; (c) 
sample-specific corruptions (and outliers), which indicate the 
phenomena that a fraction of the data samples (i.e., columns 
of the data matrix) are far away from the subspaces. 

Notice that the word "error" generally refers to the deviation 
between model assumption (i.e., subspaces) and data. It could 
exhibit as noise 0, missed entries |6], outliers lITol and 
corruptions [7] in reality. Fig J5] illustrates three typical types 
of errors under the context of subspace modeling. In this 
work, we shall focus on the sample- specific corruptions (and 
outliers) shown in FigJSJc), with mild concerns to the cases of 
Fig Ha) and Fig 0b). Notice that an outlier is from a different 
model other than subspaces, and is essentially different from a 
corrupted sample that belongs to the subspaces. We put them 
into the same category just because they can be handled in the 
same way, as will be shown in Section IV-B1 

To recover the subspace structures from the data containing 
errors, we propose a novel method termed low-rank represen- 
tation (LRR) fl4l . Given a set of data samples each of which 
can be represented as a linear combination of the bases in a 
dictionary, LRR aims at finding the lowest-rank representation 
of all data jointly. The computational procedure of LRR is to 
solve a nuclear norm ifTTl regularized optimization problem, 
which is convex and can be solved in polynomial time. By 
choosing a specific dictionary, it is shown that LRR can well 
solve the subspace clustering problem: when the data is clean, 
we prove that LRR exactly recovers the row space of the data; 
for the data contaminated by outliers, we prove that under 
certain conditions LRR can exactly recover the row space of 
the original data and detect the outlier as well; for the data 
corrupted by arbitrary errors, LRR can also approximately 
recover the row space with theoretical guarantees. Since the 
subspace membership is provably determined by the row space 
(we will discuss this in Section IIII-Bi) . these further imply 
that LRR can perform robust subspace clustering and error 
correction, in an efficient way. In summary, the contributions 
of this work include: 

• We develop a simple yet effective method, termed LRR, 
which has been used to achieve state-of-the-art perfor- 
mance in several applications such as motion segmenta- 
tion fl4), image segmentation lfT8l . saliency detection lfT9l 
and face recognition (4J. 

• Our work extends the recovery of corrupted data from a 



single subspace [7] to multiple subspaces. Compared to 
GO), which requires the bases of subspaces to be known 
for handling the corrupted data from multiple subspaces, 
our method is autonomous, i.e., no extra clean data is 
required. 

• Theoretical results for robust recovery are provided. 
While our analysis shares similar features as previous 
work in matrix completion [6] and robust PCA (RPCA) 
H , |[T6lL it is considerably more challenging due to the 
fact that there is a dictionary matrix in LRR. 

II. Related Work 

In this section, we discuss some existing subspace segmen- 
tation methods. In general, existing works can be roughly 
divided into four main categories: mixture of Gaussian, fac- 
torization, algebraic and spectral-type methods. 

In statistical learning, mixed data is typically modeled as a 
set of independent samples drawn from a mixture of proba- 
bilistic distributions. As a single subspace can be well modeled 
by a (degenerate) Gaussian distribution, it is straightforward 
to assume that each probabilistic distribution is Gaussian, i.e., 
adopting a mixture of Gaussian models. Then the problem 
of segmenting the data is converted to a model estimation 
problem. The estimation can be performed either by using the 
Expectation Maximization (EM) algorithm to find a maximum 
likelihood estimate, as done in Ell , or by iteratively finding a 
min-max estimate, as adopted by K-subspaces El and Random 
Sample Consensus (RANSAC) iHOl . These methods are sensi- 
tive to errors. So several efforts have been made for improving 
their robustness, e.g., the Median K- flats ll22ll for K-subspaces, 
the work E21 for RANSAC, and [5] use a coding length to 
characterize a mixture of Gaussian. These refinements may 
introduce some robustness. Nevertheless, the problem is still 
not well solved due to the optimization difficulty, which is a 
bottleneck for these methods. 

Factorization based methods lfl2ll seek to approximate the 
given data matrix as a product of two matrices, such that the 
support pattern for one of the factors reveals the segmentation 
of the samples. In order to achieve robustness to noise, these 
methods modify the formulations by adding extra regular- 
ization terms. Nevertheless, such modifications usually lead 
to non-convex optimization problems, which need heuristic 
algorithms (often based on alternating minimization or EM- 
style algorithms) to solve. Getting stuck at local minima may 
undermine their performances, especially when the data is 
grossly corrupted. It will be shown that LRR can be regarded 
as a robust generalization of the method in [12] (which is 
referred to as PCA in this paper). The formulation of LRR is 
convex and can be solved in polynomial time. 

Generalized Principal Component Analysis (GPCA) ll24l 
presents an algebraic way to model the data drawn from a 
union of multiple subspaces. This method describes a subspace 
containing a data point by using the gradient of a polynomial 
at that point. Then subspace segmentation is made equivalent 
to fitting the data with polynomials. GPCA can guarantee the 
success of the segmentation under certain conditions, and it 
does not impose any restriction on the subspaces. However, 
this method is sensitive to noise due to the difficulty of 
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estimating the polynomials from real data, which also causes 
the high computation cost of GPCA. Recently, Robust Alge- 
braic Segmentation (RAS) |[25l has been proposed to resolve 
the robustness issue of GPCA. However, the computation 
difficulty for fitting polynomials is unfathomably large. So 
RAS can make sense only when the data dimension is low 
and the number of subspaces is small. 

As a data clustering problem, subspace segmentation can be 
done by firstly learning an affinity matrix from the given data, 
and then obtaining the final segmentation results by spectral 
clustering algorithms such as Normalized Cuts (NCut) l26ll . 
Many existing methods such as Sparse Subspace Clustering 
(SSC) d, Spectral Curvature Clustering (SCC) l27l. l28l. 
Spectral Local Best-fit Flats (SLBF) EH, El, the proposed 
LRR method and 0, (3T), possess such spectral nature, so 
called as spectral-type methods. The main difference among 
various spectral-type methods is the approach for learning 
the affinity matrix. Under the assumption that the data is 
clean and the subspaces are independent, f]~3l shows that 
solution produced by sparse representation (SR) [32] could 
achieve the so-called £\ Subspace Detection Property (£\- 
SDP): the within-class affinities are sparse and the between- 
class affinities are all zeros. In the presence of outliers, it 
is shown in [|T5l that the SR method can still obey £\- 
SDP. However, ^i-SDP may not be sufficient to ensure the 
success of subspace segmentation l33l . Recently, Lerman and 
Zhang [34] prove that under certain conditions the multiple 
subspace structures can be exactly recovered via £ p (p < 
1) minimization. Unfortunately, since the formulation is not 
convex, it is still unknown how to efficiently obtain the 
globally optimal solution. In contrast, the formulation of LRR 
is convex and the corresponding optimization problem can be 
solved in polynomial time. What is more, even if the data 
is contaminated by outliers, the proposed LRR method is 
proven to exactly recover the right row space, which provably 
determines the subspace segmentation results (we shall discuss 
this in Section Ull-Bb . In the presence of arbitrary errors (e.g., 
corruptions, outliers and noise), LRR is also guaranteed to 
produce near recovery. 

III. Preliminaries and Problem Statement 

A. Summary of Main Notations 

In this work, matrices are represented with capital symbols. 
In particular, I is used to denote the identity matrix, and the 
entries of matrices are denoted by using [•] with subscripts. 
For instance, M is a matrix, [M]ij is its (z,j)-th entry, 
[M]i j: is its z-th row, and [M] :j j is its j-th column. For 
ease of presentation, the horizontal (resp. vertical) concate- 
nation of a collection of matrices along row (resp. column) 
is denoted by [Mi, M 2 , • • • , M k ] (resp. [Mi; M 2 ; • • • ; M k }). 
The block-diagonal matrix formed by a collection of matrices 
Mi, M 2 , • • • , M k is denoted by 




diag(Mi,M 2 ,.-. ,M k ) = 



Mi 
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Fig. 3 

An example OF THE MATRIX VqVq computed from dependent 
subspaces. In this example, we create 1 1 PAIRWISE disjoint 
subspaces each of which is of dimension 20, and draw 20 
samples from each subspace without errors. The ambient 
dimension IS 200, WHICH IS smaller than the sum of the 

DIMENSIONS OF THE SUBSPACES. SO THE SUBSPACES ARE DEPENDENT 
AND VoVq T IS NOT STRICTLY BLOCK-DIAGONAL. NEVERTHELESS, IT IS 
SIMPLE TO SEE THAT HIGH SEGMENTATION ACCURACY CAN BE ACHIEVED 
BY USING THE ABOVE AFFINITY MATRIX TO DO SPECTRAL CLUSTERING. 



The only used vector norm is the £2 norm, denoted by ||-|| 2 . 
A variety of norms on matrices will be used. The matrix £0, 
^2 0,4,^2 1 norms are defined by ||M|| = j) : [M]^- ^ 
0}, ||M|| 2 ; = #{* : ||[M] :>i || 2 ± 0}, UMII, = Zij \[MU 
and || M || 2 x = ^ ||[M] :> i||2, respectively. The matrix £00 
norm is defined as HM^ = max^j |[M]^|. The spectral 
norm of a matrix M is denoted by ||M||, i.e., ||M|| is the 
largest singular value of M. The Frobenius norm and the 
nuclear norm (the sum of singular values of a matrix) are 
denoted by ||M|| F and ||M||^, respectively. The Euclidean 
inner product between two matrices is (M, N) = tr (M T TV) , 
where M T is the transpose of a matrix and tr (•) is the trace 
of a matrix. 

The supports of a matrix M are the indices of its nonzero 
entries, i.e., {(i,j) : [M]^ ^ 0}. Similarly, its column 
supports are the indices of its nonzero columns. The symbol 
X (superscripts, subscripts, etc.) is used to denote the column 
supports of a matrix, i.e., X = {(i) : ||[M] : ^|| 2 7^ 0}. The 
corresponding complement set (i.e., zero columns) is X c . There 
are two projection operators associated with X and X c \ Vx and 
Vx c - While applying them to a matrix M, the matrix Vx(M) 
(resp. Vx c (M)) is obtained from M by setting [M] : ^ to zero 
for all i ^X (resp. i X c ). 

We also adopt the conventions of using span (M) to denote 
the linear space spanned by the columns of a matrix M, using 
y G span (M) to denote that a vector y belongs to the space 
span (M), and using Y G span (M) to denote that all column 
vectors of Y belong to span (M). 

Finally, in this paper we use several terminologies, includ- 
ing "block-diagonal matrix", "union and sum of subspaces", 
"independent (and disjoint) subspaces", "full SVD and skinny 
SVD", "pseudoinverse", "column space and row space" and 
"affinity degree". These terminologies are defined in Ap- 
pendix. 



4 



B. Relations Between Segmentation and Row Space 

Let Xo with skinny SVD Uo^oVq 1 be a collection of data 
samples strictly drawn from a union of multiple subspaces 
(i.e., Xo is clean), the subspace membership of the samples 
is determined by the row space of Xo. Indeed, as shown 
in fT2ll . when subspaces are independent, VoV T forms a 
block-diagonal matrix: the (i,j)-th entry of VoV T can be 
nonzero only if the i-th and j-th samples are from the same 
subspace. Hence, this matrix, termed as Shape Interaction 
Matrix (SIM) [12]], has been widely used for subspace seg- 
mentation. Previous approaches simply compute the SVD of 
the data matrix X = Ux^xVg and then use |VxV^| for 
subspace segmentation. However, in the presence of outliers 
and corruptions, Vx can be far away from Vb and thus the 
segmentation using such approaches is inaccurate. In contrast, 
we show that LRR can recover VoV T even when the data 
matrix X is contaminated by outliers. 

If the subspaces are not independent, VoV T may not be 
strictly block-diagonal. This is indeed well expected, since 
when the subspaces have nonzero (nonempty) intersections, 
then some samples may belong to multiple subspaces simul- 
taneously. When the subspaces are pairwise disjoint (but not 
independent), our extensive numerical experiments show that 
VoV T may still be close to be block-diagonal, as exemplified 
in Fig. [51 Hence, to recover VqVq is still of interest to 
subspace segmentation. 

C. Problem Statement 

Problem [Tj] only roughly describes what we want to study. 
More precisely, this paper addresses the following problem. 

Problem 3.1 (Subspace Clustering): Let X G R dxn with 
skinny SVD Uo^oVo store a set of n d-dimensional samples 
(vectors) strictly drawn from a union of k subspaces {<Si}f =1 
of unknown dimensions (k is unknown either). Given a set of 
observation vectors X generated by 

X — Xo + Eq , 

the goal is to recover the row space of Xo, or to recover the 
true SIM V V T as equal. 

The recovery of row space can guarantee high segmentation 
accuracy, as analyzed in Section IIII-B1 Also, the recovery of 
row space naturally implies the success in error correction. So 
it is sufficient to set the goal of subspace clustering as the 
recovery of the row space identified by VqVq . For ease of 
exploration, we consider the problem under three assumptions 
of increasing practicality and difficulty. 

Assumption 1: The data is clean, i.e., Eq = 0. 

Assumption 2: A fraction of the data samples are grossly 
corrupted and the others are clean, i.e., Eq has sparse column 
supports as shown in Fig Etc). 

Assumption 3: A fraction of the data samples are grossly 
corrupted and the others are contaminated by small Gaussian 
noise, i.e., Eo is characterized by a combination of the models 
shown in FigEJa) and Fig 12c). 

! For a matrix M, \M\ denotes the matrix with the (i, j)-th entry being the 
absolute value of [M]ij. 



Unlike fl4l . the independent assumption on the subspaces is 
not highlighted in this paper, because the analysis in this work 
focuses on recovering VoV T other than a pursuit of block- 
diagonal matrix. 

IV. Low-Rank Representation for Matrix 
Recovery 

In this section we abstractly present the LRR method 
for recovering a matrix from corrupted observations. The 
basic theorems and optimization algorithms will be presented. 
The specific methods and theories for handling the subspace 
clustering problem are deferred until Section [Vj 

A. Low-Rank Representation 

In order to recover the low-rank matrix Xo from the given 
observation matrix X corrupted by errors Eo (X = Xo + Eo), 
it is straightforward to consider the following regularized rank 
minimization problem: 

minrank(D) + Ap|| £ , s.t. X = D + E, (2) 

D ,E 

where A > is a parameter and ||-||^ indicates certain 
regularization strategy, such as the squared Frobenius norm 
(i.e., || • |||0 used for modeling the noise as show in Fig Ha) 
1 6], the £o norm adopted by for characterizing the random 
corruptions as shown in Fig|2b), and the £2,0 norm adopted 
by Q3L lfl6ll for dealing with sample- specific corruptions 
and outliers. Suppose D* is a minimizer with respect to the 
variable D, then it gives a low-rank recovery to the original 
data Xo. 

The above formulation is adopted by the recently established 
Robust PCA (RPCA) method Q which has been used to 
achieve the state-of-the-art performance in several applications 
(e.g., l35l ). However, this formulation implicitly assumes that 
the underlying data structure is a single low-rank subspace. 
When the data is drawn from a union of multiple subspaces, 
denoted as Si, S2, • • • , Sk, it actually treats the data as being 
sampled from a single subspace defined by S = J2i=i^i- 
Since the sum Yli=i can be much larger than the union 
u!- =1 Si, the specifics of the individual subspaces are not well 
considered and so the recovery may be inaccurate. 

To better handle the mixed data, here we suggest a more 
general rank minimization problem defined as follows: 

minrank(Z) + X\\E\\ £ , s.t. X = AZ + E, (3) 

Z ,E 

where A is a "dictionary" that linearly spans the data space. 
We call the minimizer Z* (with regard to the variable Z) 
the "lowest-rank representation" of data X with respect to a 
dictionary A. After obtaining an optimal solution (Z*,E*), 
we could recover the original data by using AZ* (or X — 
E*). Since rank(AZ*) < rank(Z*), AZ* is also a low- 
rank recovery to the original data Xo. By setting A = I, the 
formulation ^ falls back to Q. So LRR could be regarded 
as a generalization of RPCA that essentially uses the standard 
bases as the dictionary. By choosing an appropriate dictionary 
A, as we will see, the lowest-rank representation can recover 
the underlying row space so as to reveal the true segmentation 
of data. So, LRR could handle well the data drawn from a 
union of multiple subspaces. 
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B. Analysis on the LRR Problem 

The optimization problem ([3]) is difficult to solve due to the 
discrete nature of the rank function. For ease of exploration, 
we begin with the "ideal" case that the data is clean. That is, 
we consider the following rank minimization problem: 

mm rank (Z) , s.t. X = AZ. (4) 

It is easy to see that the solution to © may not be unique. 
As a common practice in rank minimization problems, we 
replace the rank function with the nuclear norm, resulting in 
the following convex optimization problem: 

mm||Z||^ s.t. X = AZ. (5) 

We will show that the solution to © is also a solution to d?]) 
and this special solution is useful for subspace segmentation. 

In the following, we shall show some general properties of 
the minimizer to problem ©. These general conclusions form 
the foundations of LRR (the proofs can be found in Appendix). 

1) Uniqueness of the Minimizer: The nuclear norm is 
convex, but not strongly convex. So it is possible that problem 
([5]) has multiple optimal solutions. Fortunately, it can be 
proven that the minimizer to problem is always uniquely 
defined by a closed form. This is summarized in the following 
theorem. 

Theorem 4.1: Assume A / and X = AZ have feasible 
solution(s), i.e., X G span (A). Then 

Z* = A f X, (6) 

is the unique minimizer to problem ©, where A^ is the 
pseudoinverse of A. 

From the above theorem, we have the following corollary 
which shows that problem © is a good surrogate of problem 
©. 

Corollary 4.1: Assume A ^ and X = AZ have feasible 
solutions. Let Z* be the minimizer to problem ([5]), then 
rank (Z*) = rank (X) and Z* is also a minimal rank solution 
to problem dU). 

2 ) Block-Diagonal Property of the Minimizer: By choos- 
ing an appropriate dictionary, the lowest-rank representation 
can reveal the true segmentation results. Namely, when the 
columns of A and X are exactly sampled from independent 
subspaces, the minimizer to problem (0) can reveal the sub- 
space membership among the samples. Let {<Si,<S2, * • * 

be a collection of k subspaces, each of which has a rank 
(dimension) of n > 0. Also, let A = [Ai,^,--- ,Ak] and 
X = [Xl, X2, • • • , Xk]. Then we have the following theorem. 

Theorem 4.2: Without loss of generality, assume that Ai 
is a collection of rrii samples of the z-th subspace Si, X{ 
is a collection of ni samples from Si, and the sampling of 
each Ai is sufficient such that rank (Ai) = Ti (i.e., Ai can be 
regarded as the bases that span the subspace). If the subspaces 
are independent, then the minimizer to problem (0) is block- 



diagonal: 
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where Z* is an rrii x m coefficient matrix with rank (Z*) = 
rank (Xi) , Mi. 

Note that the claim of rank(Z*) = rank(X^) guarantees 
the high within-class homogeneity of Z*, since the low-rank 
properties generally requires Z* to be dense. This is different 
from SR, which is prone to produce a "trivial" solution if 
A = X, because the sparsest representation is an identity 
matrix in this case. It is also worth noting that the above block- 
diagonal property does not require the data samples have been 
grouped together according to their subspace memberships. 
There is no loss of generality to assume that the indices of 
the samples have been rearranged to satisfy the true subspace 
memberships, because the solution produced by LRR is glob- 
ally optimal and does not depend on the arrangements of the 
data samples. 

C. Recovering Low-Rank Matrices by Convex Optimization 

Corollary 14.11 suggests that it is appropriate to use the 
nuclear norm as a surrogate to replace the rank function in 
problem ([5]). Also, the matrix £\ and £2,1 norms are good 
relaxations of the £0 and ^2,0 norms, respectively. So we could 
obtain a low-rank recovery to Xo by solving the following 
convex optimization problem: 

mjn||Z||, + A||f;|| 2il , s.t. X = AZ + E. (7) 

Here, the £2,1 norm is adopted to characterize the error term E, 
since we want to model the sample- specific corruptions (and 
outliers) as shown in FigEJc). For the small Gaussian noise 
as shown in FigEJa), H^H^ should be chosen; for the random 
corruptions as shown in FigEfb), \\E\\i is an appropriate 
choice. After obtaining the minimizer (Z*,E*), we could use 
AZ* (or X — E*) to obtain a low-rank recovery to the original 
data Xo. 

The optimization problem (|7]) is convex and can be solved 
by various methods. For efficiency, we adopt in this paper the 
Augmented Lagrange Multiplier (ALM) (36|, (371 method. We 
first convert © to the following equivalent problem: 

min || JL + A||£|| 2il , s.t. X = AZ + E,Z = J. 

Zj ,H/ , J 

This problem can be solved by the ALM method, which 
minimizes the following augmented Lagrange function: 

£ = \\J\l + X\\E\\ 2A + tr (Y?{X -AZ- E)) + 
tr {Y 2 T (Z - J)) + § (\\X -AZ- E\\ 2 F + \\Z - jf F ). 

The above problem is unconstrained. So it can be minimized 
with respect to J, Z and E, respectively, by fixing the 
other variables, and then updating the Lagrange multipliers 
Y\ and Y 2 , where /i > is a penalty parameter. The inexact 
ALM method, also called the alternating direction method, 
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Algorithm 1 Solving Problem (0 by Inexact ALM 
Input: data matrix X, parameter A. 
Initialize: Z = J = 0,E = 0,Yi = 0,Y 2 = 0,/i = 
10- 6 , /w = 10 6 , p = 1.1, and e = HT 8 . 
while not converged do 

1. fix the others and update J by 

J = argmin-!-||J||* + -|| J — (Z + Y 2 / n)\\ 2 F . 

2. fix the others and update Z by 

Z = (I + A T A) _1 (A T (X - £) + J + (A T Yi - F 2 )//i). 

3. fix the others and update E by 

E = argmin^ ||£|| 2?1 + - (X - AZ + Y 1 /^)\\ 2 F . 

4. update the multipliers 

Yi = Fi+/i(X-AZ-^), 

5. update the parameter fi by /i = min(/9/i, /i max ). 

6. check the convergence conditions: 

\\X - AZ - 2*7 1 loo < £ and ||Z - J||oo < £• 
end while 



is outlined in Algorithm [T]El Note that although Step 1 and 
Step 3 of the algorithm are convex problems, they both have 
closed-form solutions. Step 1 is solved via the Singular Value 
Thresholding (SVT) operator ll38lL while Step 3 is solved via 
the following lemma: 

Lemma 4.1 ( f3$\l ): Let Q be a given matrix. If the optimal 
solution to 

mma\\W\\ 2A + h\W-Q\\ 2 F 
w z 

is W*, then the z-th column of W* is 

1 0, otherwise. 
1) Convergence Properties: When the objective function 
is smooth, the convergence of the exact ALM algorithm has 
been generally proven in BTl . For inexact ALM, which is a 
variation of exact ALM, its convergence has also been well 
studied when the number of blocks is at most two ll36l . 
l40ll . Up to present, it is still difficult to generally ensure 
the convergence of inexact ALM with three or more blocks 
ll40ll . Since there are three blocks (including Z, J and E) in 

2 To solve the problem m\nz,E ||^||* + ^ll-^lli? s -t- X = AZ + E, one 
only needs to replace Step 3 of Algorithm [T] by E = argmin ^||^||i + 

7%\\E — (X — AZ + Y\/y)\\ 2 F , which is solved by using the shrinkage 
operator (36). 

Also, please note here that the setting of e = 10 _ 8 is based on the 
assumption that the values in X has been normalized within the range of 
~ 1. 



Algorithm \T\ and the objective function of (|7]) is not smooth, 
it would be not easy to prove the convergence in theory. 

Fortunately, there actually exist some guarantees for en- 
suring the convergence of Algorithm [T] According to the 
theoretical results in ATI , two conditions are sufficient (but 
may not necessary) for Algorithm Q] to converge: the first 
condition is that the dictionary matrix A is of full column 
rank; the second one is that the optimality gap produced in 
each iteration step is monotonically decreasing, namely the 
error 

e k = \\(Z k ,Jk) ~ argmin/: || % 

is monotonically decreasing, where Z k (resp. J k ) denotes the 
solution produced at the /c-th iteration, argmin^ j C indicates 
the "ideal" solution obtained by minimizing the Lagrange 
function C with respect to both Z and J simultaneously. 
The first condition is easy to obey, since problem (|7]) can be 
converted into an equivalent problem where the full column 
rank condition is always satisfied (we will show this in the 
next subsection). For the monotonically decreasing condition, 
although it is not easy to strictly prove it, the convexity of the 
Lagrange function could guarantee its validity to some extent 
|4H . So, it could be well expected that Algorithm [T] has good 
convergence properties. Moreover, inexact ALM is known to 
generally perform well in reality, as illustrated in Boll . 

That [i should be upper bounded (Step 5 of Algorithm Q]) is 
required by the traditional theory of the alternating direction 
method in order to guarantee the convergence of the algorithm. 
So we also adopt this convention. Nevertheless, please note 
that the upper boundedness may not be necessary for some 
particular problems, e.g., the RPCA problem as analyzed in 

EE 

2) Computational Complexity: For ease of analysis, we 
assume that the sizes of both A and X are d x n in the 
following. The major computation of Algorithm Q] is Step 1, 
which requires computing the SVD of an n x n matrix. So 
it will be time consuming if n is large, i.e., the number of 
data samples is large. Fortunately, the computational cost of 
LRR can be easily reduced by the following theorem, which 
is followed from Theorem 14.11 

Theorem 4.3: For any optimal solution (Z*,£^*) to the 
LRR problem (|7]), we have that 

Z* e span (A T ) . 
The above theorem concludes that the optimal solution Z* 
(with respect to the variable Z) to © always lies within the 
subspace spanned by the rows of A. This means that Z* can 
be factorized into Z* = P*Z*, where P* can be computed 
in advance by orthogonalizing the columns of A T . Hence, 
problem can be equivalently transformed into a simpler 
problem by replacing Z with P*Z: 

min ||Z||* + A ||£|| 2 x , s.t. X = BZ + E, 

Z,E 

where B = AP* . After obtaining a solution (Z*,£^*) to the 
above problem, the optimal solution to is recovered by 
(P*Z*, E*). Since the number of rows of Z is at most ta (the 
rank of A), the above problem can be solved with a complexity 



of 0{dnrA J rnr 2 A J rr\) by using Algorithm[T] So LRR is quite 
scalable for large- size (n is large) datasets, provided that a low- 
rank dictionary A has been obtained. While using A = X, the 
computational complexity is at most 0(d 2 n + d 3 ) (assuming 
d < n). This is also fast provided that the data dimension d 
is not high. 

While considering the cost of orthogonalization and the 
number of iterations needed to converge, the complexity of 
Algorithm [T] is 

0(d 2 n) + 0(n s (dnr A + nr\ + r\)), 

where n s is the number of iterations. The iteration number n s 
depends on the choice of p: n s is smaller while p is larger, and 
vice versa. Although larger p does produce higher efficiency, 
it has the risk of losing optimality to use large p l36l . In our 
experiments, we always set p = 1.1. Under this setting, the 
iteration number usually locates within the range of 50 ~ 300. 

V. Subspace Clustering by LRR 

In this section, we utilize LRR to address Problem 13.11 
which is to recover the original row space from a set of cor- 
rupted observations. Both theoretical and experimental results 
will be presented. 

A. Exactness to Clean Data 

When there are no errors in data, i.e., X = Xo and Eo — 0, 
it is simple to show that the row space (identified by VoV T ) 
of Xq is exactly recovered by solving the following nuclear 
norm minimization problem: 

mm\\Z\l, s.t. X = XZ, (8) 

which is to choose the data matrix X itself as the dictionary 
in ([5]). By Theorem |4. 11 we have the following theorem which 
has also been proven by Wei and Lin l42ll . 

Theorem 5.1: Suppose the skinny SVD of X is /7EV T ', 
then the minimizer to problem ([8]) is uniquely defined by 

Z* VV T . 

This naturally implies that Z* exactly recovers VoV T when 
X is clean (i.e., Eo = 0). 

The above theorem reveals the connection between LRR and 
the method in fT2l . which is a counterpart of PC A (referred to 
as "PC A" for simplicity). Nevertheless, it is well known that 
PC A is fragile to the presence of outliers. In contrast, it can be 
proven in theory that LRR exactly recovers the row space of 
Xo from the data contaminated by outliers, as will be shown 
in the next subsection. 

B. Robustness to Outliers and Sample -Specific Corruptions 

Assumption 2 is to imagine that a fraction of the data 
samples are away from the underlying subspaces. This implies 
that the error term Eo has sparse column supports. So, the 
^2,i norm is appropriate for characterizing .Eo- By choosing 
A = X in (|7]), we have the following convex optimization 
problem: 

min||Z||*+A||£|| 2 ,i, s.t. X = XZ + E. (9) 




U*(U*) T E* 

Fig. 4 

An example of the matrices U* (U*) t and E* computed from 

THE DATA CONTAMINATED BY OUTLIERS. IN A SIMILAR WAY AS iflUl . 
WE CREATE 5 PAIRWISE DISJOINT SUBSPACES EACH OF WHICH IS OF 
DIMENSION 4, AND DRAW 40 SAMPLES (WITH AMBIENT DIMENSION 200) 
FROM EACH SUBSPACE. THEN, 50 OUTLIERS ARE RANDOMLY GENERATED 
FROM jV(0, s), WHERE THE STANDARD DEVIATION s IS SET TO BE THREE 
TIMES AS LARGE AS THE AVERAGED MAGNITUDES OF THE SAMPLES. BY 

CHOOSING 0.16 < A < 0.34, LRR PRODUCES A SOLUTION (Z* , E*) 
WITH THE COLUMN SPACE OF Z* EXACTLY RECOVERING THE ROW SPACE 
OF Xo, AND THE COLUMN SUPPORTS OF E* EXACTLY IDENTIFYING THE 
INDICES OF THE OUTLIERS. 

The above formulation "seems" questionable, because the 
data matrix (which itself can contain errors) is used as the 
dictionary for error correction. Nevertheless, as shown in the 
following two subsections, A = X is indeed a good choice 
for several particular problems 0. 

1) Exactness to Outliers: When an observed data sample 
is far away from the underlying subspaces, a typical regime is 
that this sample is from a different model other than subspaces, 
so called as an outlier 0. In this case, the data matrix X 
contains two parts, one part consists of authentic samples 
(denoted by Xo) strictly drawn from the underlying subspaces, 
and the other part consists of outliers (denoted as Eo) that are 
not subspace members. To precisely describe this setting, we 
need to impose an additional constraint on Xo, that is, 

Vx (Xo) = 0, (10) 

where Xo is the indices of the outliers (i.e., the column supports 
of Eo). Furthermore, we use n to denote the total number of 
data samples in X, 7 = \Xo\/n the fraction of outliers, and ro 
the rank of Xo. With these notations, we have the following 
theorem which states that LRR can exactly recover the row 
space of Xo and identify the indices of outliers as well. 

Theorem 5.2 ( K43\l ): There exists 7* > such that LRR 
with parameter A = 3/(7||X|| v^y*n) strictly succeeds, as long 
as 7 < 7*. Here, the success is in a sense that any minimizer 
(Z*,E*) to © can produce 

U*(U*) T = VoVo T and X*=X , (11) 

where U* is the column space of Z*, and X* is column 
supports of E*. 

3 Note that this does not deny the importance of learning the dictionary. 
Indeed, the choice of dictionary is a very important aspect in LRR. We leave 
this as future work. 

4 Precisely, we define an outlier as a data vector that is independent to the 
samples drawn from the subspaces |43|. 



(a) (b) 

Fig. 5 

TWO EXAMPLES OF THE MATRIX U* (U*) T COMPUTED FROM THE 
DATA CORRUPTED BY SAMPLE-SPECIFIC CORRUPTIONS. (A) THE 

MAGNITUDES OF THE CORRUPTIONS ARE SET TO BE ABOUT 0.7 TIMES AS 

LARGE AS THE SAMPLES . CONSIDERING | U* (U* ) T | AS AN AFFINITY 
MATRIX, THE AVERAGE AFFINITY DEGREE OF THE CORRUPTED SAMPLES 
IS ABOUT 40, WHICH MEANS THAT THE CORRUPTED SAMPLES CAN BE 
PROJECTED BACK ONTO THEIR RESPECTIVE SUBSPACES. (B) THE 
MAGNITUDES OF THE CORRUPTIONS ARE SET TO BE ABOUT 3.5 TIMES AS 

LARGE AS THE SAMPLES. THE AFFINITY DEGREES OF THE CORRUPTED 
SAMPLES ARE ALL ZERO, WHICH MEANS THAT THE CORRUPTED SAMPLES 
ARE TREATED AS OUTLIERS. IN THESE EXPERIMENTS, THE DATA SAMPLES 
ARE GENERATED IN THE SAME WAY AS IN FlG|4] THEN, 10% SAMPLES 
ARE RANDOMLY CHOSEN TO BE CORRUPTED BY ADDITIVE ERRORS OF 

Gaussian distribution. For each experiment, the parameter A is 

CAREFULLY DETERMINED SUCH THAT THE COLUMN SUPPORTS OF E* 
IDENTIFY THE INDICES OF THE CORRUPTED SAMPLES. 

There are several importance notices in the above theorem. 
First, although the objective function © is not strongly convex 
and multiple minimizers may exist, it is proven that any 
minimizer is effective for subspace clustering. Second, the 
coefficient matrix Z* itself does not recover V$V(f (notice 
that Z* is usually asymmetric except E* = 0), and it is the 
column space of Z* that recovers the row space of Xq. Third, 
the performance of LRR is measured by the value of 7* (the 
larger, the better), which depends on some data properties such 
as the incoherence and the extrinsic rank ro (7* is larger when 
ro is lower). For more details, please refer to (43). 

FigH shows some experimental results, which verify the 
conclusions of Theorem 15.21 Notice that the parameter setting 
A = 3/(7||X|| y / 7*n) is based on the condition 7 < 7* (i.e., 
the outlier fraction is smaller than a certain threshold), which is 
just a sufficient (but not necessary) condition for ensuring the 
success of LRR. So, in practice (even for synthetic examples) 
where 7 > 7*, it is possible that other values of A achieve 
better performances. 

2 ) Robustness to Sample-Specific Corruptions: For the phe- 
nomenon that an observed sample is away from the subspaces, 
another regime is that this sample is an authentic subspace 
member, but grossly corrupted. Usually, such corruptions 
only happen on a small fraction of data samples, so called 
as "sample-specific" corruptions. The modeling of sample- 
specific corruptions is the same as outliers, because in both 
cases Eq has sparse column supports. So the formulation © 
is still applicable. However, the setting (ITOl) is no longer valid, 
and thus LRR may not exactly recover the row space VqVq 
in this case. Empirically, the conclusion of X* = 2q still holds 



U*(U*) T E* 

Fig. 6 

An example of the matrices U* (U*) t and E* computed from 

THE DATA CONTAMINATED BY NOISE, OUTLIERS AND 
SAMPLE-SPECIFIC CORRUPTIONS. IN THIS EXPERIMENT, FIRST, WE 

CREATE 10 PAIRWISE DISJOINT SUBSPACES (EACH OF WHICH IS OF 
DIMENSION 4) AND DRAW 40 SAMPLES (WITH AMBIENT DIMENSION 
2000) FROM EACH SUBSPACE. SECOND, WE RANDOMLY CHOOSE 10% 
SAMPLES TO BE GROSSLY CORRUPTED BY LARGE ERRORS. THE REST 90% 
SAMPLES ARE SLIGHTLY CORRUPTED BY SMALL ERRORS. FINALLY, AS IN 
FIgQ] 100 OUTLIERS ARE RANDOMLY GENERATED. THE TOTAL AMOUNT 
OF ERRORS (INCLUDING NOISE, SAMPLE-SPECIFIC CORRUPTIONS AND 
OUTLIERS) IS GIVEN BY || Eq \\ F / \\Xq \\ p = 0-63. BY SETTING A = 0.3, 
U*(U*) T APPROXIMATELY RECOVERS VqV£ WITH ERROR 

\\U*(U*) T - VoV^Hf/IIVoV^Hf = 0.17, and the column supports 

OF E* ACCURATELY IDENTIFY THE INDICES OF THE OUTLIERS AND 
CORRUPTED SAMPLES. IN CONTRAST, THE RECOVER ERROR PRODUCED 
BY PCA IS 0.66, AND THAT BY THE RPCA METHOD (USING THE BEST 
PARAMETERS) INTRODUCED IN fTTH IS 0.23. 

031, which means that the column supports of E* can identify 
the indices of the corrupted samples. 

While both outliers and sample- specific corruptions are 
handled in the same way, a question is how to deal with 
the cases where the authentic samples are heavily corrupted 
to have similar properties as the outliers. If a sample is 
heavily corrupted so as to be independent from the underlying 
subspaces, it will be treated as an outlier in LRR, as illustrated 
in FigUl This is a reasonable manipulation. For example, it is 
appropriate to treat a face image as a non-face outlier if the 
image has been corrupted to be look like something else. 

C Robustness in the Presence of Noise, Outliers and Sample- 
Specific Corruptions 

When there is noise in the data, the column supports of 
Eq are not strictly sparse. Nevertheless, the formulation © 
is still applicable, because the £2,1 norm (which is relaxed 
from £2,0 norm) can handle well the signals that approximately 
have sparse column supports. Since all observations may be 
contaminated, it is unlikely in theory that the row space VqVq 
can be exactly recovered. So we target on near recovery in this 
case. By the triangle inequality of matrix norms, the following 
theorem can be simply proven without any assumptions. 

Theorem 5.3: Let the size of X be d x n, and the rank of 
Xq be ro. For any minimizer (Z*,E*) to problem © with 
A > 0, we have 

\\Z*-V V?\\ F < min(d,n)+r . 

5 Unlike outlier, a corrupted sample is unnecessary to be independent to the 
clean samples. 
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Algorithm 2 Subspace Segmentation 

Input: data matrix X, number k of subspaces. 

1. obtain the minimizer Z* to problem ©. 

2. compute the skinny SVD Z* = /7*E*(V*) T . 

3. construct an affinity matrix W by (IT2T) . 

4. use W to perform NCut and segment the data samples 
into k clusters. 



Algorithm 3 Estimating the Subspace Number k 
Input: data matrix X. 

1. compute the affinity matrix W in the same way as in 
Algorithm [2 

2. compute the Laplacian matrix L = I — D~^WD~^ , 
where D = diag (l^i;, • • • , Zj[W] nj ). 

3. estimate the subspace number by (fT3l) . 



Fig demonstrates the performance of LRR, in the presence 
of noise, outliers and sample- specific corruptions. It can be 
seen that the results produced by LRR are quite promising. 

One may have noticed that the bound given in above 
theorem is somewhat loose. To obtain a more accurate bound 
in theory, one needs to relax the equality constraint of © into: 

min ||Z||* + A||£|| 2 ,i,s.t. \\X - XZ - E\\ F < £, 

where £ is a parameter for characterizing the amount of the 
dense noise (Fig Ha)) possibly existing in data. The above 
problem can be solved by ALM, in a similar procedure as 
Algorithm [T] However, the above formulation needs to invoke 
another parameter £, and thus we do not further explore it in 
this paper. 

D. Algorithms for Subspace Segmentation, Model Estimation 
and Outlier Detection 

1) Segmentation with Given Subspace Number: After ob- 
taining (Z*, E*) by solving problem ©, the matrix U*(U*) T 
that identifies the column space of Z* is useful for subspace 
segmentation. Let the skinny SVD of Z* as /7*E*(V*) T , we 
define an affinity matrix W as follows: 

Wii = (PU T \i 3 ) 2 i (12) 

where U is formed by ?7*(E*)^ with normalized rows. Here, 
for obtaining better performance on corrupted data, we assign 
each column of U* a weight by multiplying (E*)2. Notice 
that when the data is clean, E* = I and thus this technique 
does not take any effects. The technical detail of using (-) 2 
is to ensure that the values of the affinity matrix W are 
positive (note that the matrix UU T can have negative values). 
Finally, we could use the spectral clustering algorithms such 
as Normalized Cuts (NCut) |26] to segment the data samples 
into a given number k of clusters. Algorithm [2 summarizes 
the whole procedure of performing segmentation by LRR. 

2) Estimating the Subspace Number k: Although it is 
generally challenging to estimate the number of subspaces 
(i.e., number of clusters), it is possible to resolve this model 
estimation problem due to the block-diagonal structure of the 
affinity matrix produced by specific algorithms C2L l44l , l45ll . 



While a strictly block-diagonal affinity matrix W is obtained, 
the subspace number k can be found by firstly computing the 
normalized Laplacian (denoted as L) matrix of W, and then 
counting the number of zero singular values of L. While the 
obtained affinity matrix is just near block-diagonal (this is the 
case in reality), one could predict the subspace number as the 
number of singular values smaller than a threshold. Here, we 
suggest a soft thresholding approach that outputs the estimated 
subspace number k by 

n 

k = n-mt(J2f r (a l )). (13) 

Here, n is the total number of data samples, {cri}™ = i are the 
singular values of the Laplacian matrix L, int(-) is the function 
that outputs the nearest integer of a real number, and f T (-) is 
a soft thresholding operator defined as 

Jr[(J) \log 2 (l + ^), otherwise, 

where < r < 1 is a parameter. Algorithm [3] summarizes the 
whole procedure of estimating the subspace number based on 
LRR. 

3 ) Outlier Detection: As shown in Theorem 15.21 the mini- 
mizer E* (with respect to the variable E) can be used to detect 
the outliers that possibly exist in data. This can be simply done 
by finding the nonzero columns of E*, when all or a fraction 
of data samples are clean (i.e., Assumption 1 and Assumption 
2). For the cases where the learnt E* only approximately has 
sparse column supports, one could use thresholding strategy; 
that is, the z-th data vector of X is judged to be outlier if and 
only if 

\\[E*].J\ 2 >5, (14) 

where S > is a parameter. 

Since the affinity degrees of the outliers are zero or close 
to being zero (see Fig|4] and Fig |6]), the possible outliers 
can be also removed by discarding the data samples whose 
affinity degrees are smaller than a certain threshold. Such a 
strategy is commonly used in spectral-type methods lfT3lL (34). 
Generally, the underlying principle of this strategy is essential 
the same as (IT4l> . Comparing to the strategy of characterizing 
the outliers by affinity degrees, there is an advantage of using 
E* to indicate outliers; that is, the formulation © can be 
easily extended to include more priors, e.g., the multiple visual 
features as done in |[T8lL lfT9l . 

VI. Experiments 

LRR has been used to achieve state-of-the-art performance 
in several applications such as motion segmentation [4], image 
segmentation lfT8l . face recognition (4) and saliency detection 
lfT9l . In the experiments of this paper, we shall focus on 
analyzing the essential aspects of LRR, under the context of 
subspace segmentation and outlier detection. 

A. Experimental Data 

1 ) Hopkins 155: To verify the segmentation performance of 
LRR, we adopt for experiments the Hopkins 155 [|46l motion 
database, which provides an extensive benchmark for testing 
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TABLE I 

Some information about Hopkins 155. 





data 


# of data 


#of 


error 




dimension 


samples 


subspaces 


level 


max 


201 


556 


3 


0.0130 


min 


31 


39 


2 


0.0002 


mean 


59.7 


295.7 


2.3 


0.0009 


std. 


20.2 


140.8 


0.5 


0.0012 




Fig. 7 

Examples of the images in the Yale-Caltech dataset. 



various subspace segmentation algorithms. In Hopkins 155, 
there are 156 video sequences along with the features extracted 
and tracked in all the frames. Each sequence is a sole dataset 
(i.e., data matrix) and so there are in total 156 datasets of 
different properties, including the number of subspaces, the 
data dimension and the number of data samples. Although 
the outliers in the data have been manually removed and the 
overall error level is low, some sequences (about 10 sequences) 
are grossly corrupted and have notable error levels. Table 
U summarizes some information about Hopkins 155. For a 
sequence represented as a data matrix X, its error level is esti- 
mated by its rank-r approximation: 1 1 X — U r T^ r V^ \\f/\\X\\f, 
where E r contains the largest r singular values of X, and 
U r (resp. V r ) is formed by taking the top r left (resp. right) 
singular vectors. Here, we set r = 4k (k is the subspace 
number of the sequence), due to the fact that the rank of each 
subspace in motion data is at most 4. 

2) Yale-Caltech: To test LRR's effectiveness in the pres- 
ence of outliers and corruptions, we create a dataset by 
combining Extended Yale Database B B71 and CaltechlOl 
ll48l . For Extended Yale Database B, we remove the images 
pictured under extreme light conditions. Namely, we only use 
the images with view directions smaller than 45 degrees and 
light source directions smaller than 60 degrees, resulting in 
1204 authentic samples approximately drawn from a union of 
38 low-rank subspaces (each face class corresponds to a sub- 
space). For CaltechlOl, we only select the classes containing 
no more than 40 images, resulting in 609 non-face outliers. 
Fig|7] shows some examples of this dataset. 

B. Baselines and Evaluation Metrics 

Due to the close connections between PCA and LRR, we 
choose PCA and RPCA methods as the baselines. Moreover, 
some previous subspace segmentation methods are also con- 
sidered. 

1) PCA (i.e., SIM): The PCA method is widely used for 
dimension reduction. Actually, it can also be applied to sub- 
space segmentation and outlier detection as follows: first, we 



use SVD to obtain the rank-r (r is a parameter) approximation 
of the data matrix X, denoted as X « U r T, r V^ m , second, we 
utilize V r Vr , which is an estimation of the true SIM VqVq ', 
for subspace segmentation in a similar way as Algorithm [2] (the 
only difference is the estimation of SIM); finally, we compute 
E r = X — UrYirVj and use E r to detect outliers according 
to ©. 

2) RPCA: As an improvement over PCA, the robust PCA 
(RPCA) methods can also do subspace segmentation and out- 
lier detection. In this work, we consider two RPCA methods 
introduced in and lfT6l . which are based on minimizing 

mm\\D\l + X\\E\U, s.t. X = D + E. 

In 0, the £\ norm is used to characterize random corrup- 
tions, so referred to as "RPCAi". In lfT6l . the £2,1 norm is 
adopted for detecting outliers, so referred to as "RPCA21". 
The detailed procedures for subspace segmentation and outlier 
detection are almost the same as the PCA case above. The only 
difference is that V r is formed from the skinny SVD of D* 
(not X), which is obtained by solving the above optimization 
problem. Note here that the value of r is determined by the 
parameter A, and thus one only needs to select A. 

3) SR: LRR has similar appearance as SR, which has been 
applied to subspace segmentation |[T3l . For fair comparison, in 
this work we implement an £2,1 -norm based SR method that 
computes an affinity matrix by minimizing 

min \\Z\\ X + A||£|| 2> i, s.t. X = XZ + E, [Z] u = 0. 

Here, SR needs to enforce [Z]u = to avoid the trivial 
solution Z = I. After obtaining a minimizer (Z*,E*) 9 we 
use W = \Z*\ + \(Z*) T \ as the affinity matrix to do subspace 
segmentation. The procedure of using E* to perform outlier 
detection is the same as LRR. 

4) Some other Methods: We also consider for compari- 
son some previous subspace segmentation methods, including 
Random Sample Consensus (RANSAC) |[T0lL Generalized 
PCA (GPCA) L24J, Local Subspace Analysis (LSA) 0, Ag- 
glomerative Lossy Compression (ALC) 0, Sparse Subspace 
Clustering (SSC) E2, Spectral Clustering (SC) ED, Spectral 
Curvature Clustering (SCC) |27], Multi Stage Learning (MSL) 
62 , Locally Linear Manifold Clustering (LLMC) El, Local 
Best-fit Flats (LBF) E3 and Spectral LBF (SLBF) E3. 

5) Evaluation Metrics: Segmentation accuracy (error) is 
used to measure the performance of segmentation. The areas 
under the receiver operator characteristic (ROC) curve, known 
as AUC, is used for for evaluating the quality of outlier 
detection. For more details about these two evaluation metrics, 
please refer to Appendix. 

C. Results on Hopkins 155 

1) Choosing the Parameter A: The parameter A > is 
used to balance the effects of the two parts in problem (|9]). 
In general, the choice of this parameter depends on the prior 
knowledge of the error level of data. When the errors are slight, 
we should use relatively large A; when the errors are heavy, 
we should set A to be relatively small. 
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Fig. 8 

The influences of the parameter A of LRR. (a) On all 156 

SEQUENCES OF HOPKINS 155, THE OVERALL SEGMENTATION 
PERFORMANCE IS EQUALLY GOOD WHILE 3 < A < 5. (B) ON THE 43-TH 
SEQUENCE, THE SEGMENTATION ERROR IS ALWAYS FOR 

0.001 < A < 1000. (c) On the 62-th sequence, the segmentation 

PERFORMANCE IS GOOD ONLY WHEN 0.8 < A < 1.6. 
TABLE II 

Segmentation results (on Hopkins 155) of PCA, RPCAi, RPCA 2 ,i, 
SR and LRR. 



segmentation errors (%) over all 156 sequences 

PCA RPCAi RPCA 2 ,i SR LRR 



mean 4.56 
std. 10.80 
max 49.78 



4.13 
10.37 
45.83 



3.26 
9.09 
47.15 



3.89 1.71 
7.70 4.85 
32.57 33.33 



average run time (seconds) per sequence 

0.2 0.8 0.8 4.2 1.9 



FigUta) shows the evaluation results over all 156 sequences 
in Hopkins 155: while A ranges from 1 to 6, the segmentation 
error only varies from 1.69% to 2.81%; while A ranges from 
3 to 5, the segmentation error almost remains unchanged, 
slightly varying from 1.69% to 1.87%. This phenomenon is 
mainly due to two reasons as follows. First, on most sequences 
(about 80%) which are almost clean and easy to segment, LRR 
could work well by choosing A arbitrarily, as exemplified in 
FigMb). Second, there is an "invariance" in LRR, namely 
Theorem 14 . 3 1 implies that the minimizer to problem © always 
satisfies Z* G span (X T ) . This implies that the solution of 
LRR can be partially stable while A is varying. 

The analysis above does not deny the importance of model 
selection. As shown in FigJUc), the parameter A can largely 
affect the segmentation performance on some sequences. Ac- 
tually, if we turn A to the best for each sequence, the overall 
error rate is only 0.07%. Although this number is achieved in 
an "impractical" way, it verifies the significance of selecting 
the parameter A, especially when the data is corrupted. For 
the experiments below, we choose A = 4 for LRR. 

2 ) Segmentation Performance: In this subsection, we show 
LRR's performance in subspace segmentation with the sub- 
space number given. For comparison, we also list the results of 
PCA, RPCAi, RPCA 2 ,i and SR (these methods are introduced 
in Section IVI-Bb . Table [Til illustrates that LRR performs better 
than PCA and RPCA. Here, the advantages of LRR are mainly 
due to its methodology. More precisely, LRR directly targets 
on recovering the row space VqVq , which provably determines 
the segmentation results. In contrast, PCA and RPCA methods 
are designed for recovering the column space UqUq, which is 
designed for dimension reduction. One may have noticed that 
RPCA 2 ,i outperforms PCA and RPCAi. If we use instead 



TABLE III 

Results (on Hopkins 155) of estimating the subspace number. 



# total # predicted prediction rate (%) absolute error 

156 121 TL6 0.25 

influences of the parameter r 
parameter r 0.06 0.07 0.08 0.09 0.10 0.11 
prediction rate 66.7 71.2 77.6 75.0 72.4 71.2 
absolute error 0.37 0.30 0.25 0.26 0.29 0.30 

TABLE IV 

Segmentation errors (%) on Hopkins 155 (155 sequences). 





GPCA 


RANSAC 


MSL 


LSA 


LLMC 


mean 


10.34 


9.76 


5.06 


4.94 


4.80 




PCA 


LBF 


ALC 


sec 


SLBI 


mean 


4.47 


3.72 


3.37 


2.70 


1.35 








LRR 




ssc 


SC 


1 51 1 


141 


this paper 


mean 


1.24 


1.20 


1.22 


0.85 


1.59 



the t\ norm to regularize E in ©, the segmentation error is 
2.03% (A = 0.6, optimally determined). These illustrate that 
the errors in this database tend to be sample- specific. 

Besides the superiorities in segmentation accuracy, another 
advantage of LRR is that it can work well under a wide range 
of parameter settings, as shown in FigO Whereas, RPCA 
methods are sensitive to the parameter A. Taking RPCA21 
for example, it achieves an error rate of 3.26% by choosing 
A = 0.32. However, the error rate increases to 4.5% at 
A = 0.34, and 3.7% at A = 0.3. 

The efficiency (in terms of running time) of LRR is 
comparable to PCA and RPCA methods. Theoretically, the 
computational complexity (with regard to d and n) of LRR is 
the same as RPCA methods. LRR costs more computational 
time because its optimization procedure needs more iterations 
than RPCA to converge. 

3) Performance of Estimating Subspace Number: Since 
there are 156 sequences in total, this database also provides a 
good benchmark for evaluating the effectiveness of Algorithm 
[3 which is to estimate the number of subspaces underlying 
a collection of data samples. Table (TTTI shows the results. By 
choosing r = 0.08, LRR correctly predicts the true subspace 
number of 121 sequences. The absolute error (i.e., \k — k\) 
averaged over all sequences is 0.25. These results illustrate that 
it is hopeful to resolve the problem of estimating the subspace 
number, which is a challenging model estimation problem. 

4) Comparing to State-of-the-art Methods: Notice that pre- 
vious methods only report the results for 155 sequences. After 
discarding the degenerate sequence, the error rate of LRR is 
1.59% which is comparable to the state-of-the-art methods, as 
shown in Table [IVJ The performance of LRR can be further 
improved by refining the formulation ©, which uses the 
observed data matrix X itself as the dictionary. When the data 
is corrupted by dense noise (this is usually true in reality), this 
certainly is not the best choice. In BTTl and l42ll . a non-convex 
formulation is adopted to learn the original data Xo and its 
row space VqVq simultaneously: 



mm 

D,Z,E 



|£||i s.t. X = D + E,D = DZ, 
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TABLE V 

Segmentation accuracy (ACC) and AUC comparison on the 
Yale-Caltech dataset. 





PCA 


RPCAi 


RPCA 2 ,i 


SR 


LRR 


ACC (%) 


77.15 


82.97 


83.72 


73.17 


86.13 


AUC 


0.9653 


0.9819 


0.9863 


0.9239 


0.9927 


time (sec.) 


0.6 


60.8 


59.2 


383.5 


152.6 



where the unknown variable D is used as the dictionary. 
This method can achieve an error rate of 1.22%. In fl4), it is 
explained that the issues of choosing dictionary can be relieved 
by considering the unobserved, hidden data. Furthermore, it is 
deduced that the effects of hidden data can be approximately 
modeled by the following convex formulation: 

min ||Z||* + + A||£||i s.t. X = XZ + LX + E, 

Z,L,E 

which intuitively integrates subspace segmentation and feature 
extraction into a unified framework. This method can achieve 
an error rate of 0.85%, which outperforms other subspace 
segmentation algorithms. 

While several methods have achieved an error rate below 
3% on Hopkins 155, subspace segmentation problem is till 
far from solved. A long term difficult is how to solve the 
model selection problems, e.g., estimating the parameter A of 
LRR. Also, it would not be trivial to handle more complicated 
datasets that contain more noise, outliers and corruptions. 

D. Results on Yale-Caltech 

The goal of this test is to identify 609 non-face outliers 
and segment the rest 1204 face images into 38 clusters. The 
performance of segmentation and outlier detection is evalu- 
ated by segmentation accuracy (ACC) and AUC, respectively. 
While investigating segmentation performance, the affinity 
matrix is computed from all images, including both the face 
images and non-face outliers. However, for the convenience 
of evaluation, the outliers and the corresponding affinities are 
removed (according to the ground truth) before using NCut to 
obtain the segmentation results. 

We resize all images into 20 x 20 pixels and form a data 
matrix of size 400 x 1813. Table [V] shows the results of 
PCA, RPCA, SR and LRR. It can be seen that LRR is better 
than PCA and RPCA methods, in terms of both subspace 
segmentation and outlier detection. These experimental results 
are consistent with Theorem 15.21 which shows that LRR has 
a stronger guarantee than RPCA methods in performance. 
Notice that SR is behind the others @. This is because the 
presence or absence of outliers is unnecessary to notably alert 
the sparsity of the reconstruction coefficients, and thus it is 
hard for SR to handle well the data contaminated by outliers. 

Fig J5] shows the performance of LRR while the parameter A 
varies from 0.06 to 0.22. Notice that LRR is more sensitive to 
A on this dataset than on Hopkins 155. This is because the error 

6 The results (for outlier detection) in Table [V] are obtained by using the 
strategy of ( fl4l While using the strategy of checking the affinity degree, the 
results produced by SR is even worse, only achieving an AUC of 0.81 by 
using the best parameters. 











-B-AUC 

O Segmentation Accuracy 







0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 

parameter X 



Fig. 9 

The influences of the parameter A of LRR. These results are 

COLLECTED FROM THE YALE-CALTECH DATASET. ALL IMAGES ARE 
RESIZED TO 20 X 20 PIXELS. 



X = XZ* + E* 




Fig. 10 

Some examples of using LRR to correct the errors in the 

YALE-CALTECH DATASET. LEFT: THE ORIGINAL DATA MATRIX X; 

Middle: the corrected data XZ* ; Right: the error E*. 



level of Hopkins 155 is quite low (see Table [I]), whereas, the 
Yale-Caltach dataset contains outliers and corrupted images 
(see Fig |7]). 

To visualize LRR's effectiveness in error correction, we 
create another data matrix with size 8064 x 1813 by resizing 
all images into 96 x 84. Fig [10| shows some results produced by 
LRR. It is worth noting that the "error" term E* can contain 
"useful" information, e.g., the eyes and salient objects. Here, 
the principle is to decompose the data matrix into a low- 
rank part and a sparse part, with the low-rank part (XZ*) 
corresponding to the principal features of the whole dataset, 
and the sparse part (E*) corresponding to the rare features 
which cannot be modeled by low-rank subspaces. This implies 
that it is possible to use LRR to extract the discriminative 
features and salient regions, as done in face recognition (H 
and saliency detection fl9l . 
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VII. Conclusion and Future Work 

In this paper we proposed low-rank representation (LRR) to 
identify the subspace structures from corrupted data. Namely, 
our goal is to segment the samples into their respective 
subspaces and correct the possible errors simultaneously. LRR 
is a generalization of the recently established RPCA methods 
171 . fT6ll . extending the recovery of corrupted data from single 
subspace to multiple subspaces. Also, LRR generalizes the 
approach of Shape Interaction Matrix (SIM), giving a way to 
define an SIM between two different matrices (see Theorem 
14.1b , and providing a mechanism to recover the true SIM 
(or row space) from corrupted data. Both theoretical and 
experimental results show the effectiveness of LRR. However, 
there still remain several problems for future work: 

• It may achieve significant improvements by learning a 
dictionary A, which partially determines the solution of 
LRR. In order to exactly recover the row space Vb, 
Theorem 14.31 illustrates that the dictionary A must satisfy 
the condition of Vb G span (A T ). When the data is only 
contaminated by outliers, this condition can be obeyed 
by simply choosing A — X. However, this choice cannot 
ensure the validity of Vo G span (A T ) while the data 
contains other types of errors, e.g., dense noise. 

• The proofs of Theorem 15.21 are specific to the case 
of A = X. As a future direction, it is interesting to 
see whether the technique presented can be extended to 
general dictionary matrices other than X. 

• A critical issue in LRR is how to estimate or select the 
parameter A. For the data contaminated by various errors 
such as noise, outliers and corruptions, the estimation of 
A is quite challenging. 

• The subspace segmentation should not be the only ap- 
plication of LRR. Actually, it has been successfully used 
in the applications other than segmentation, e.g., saliency 
detection |fT9l . In general, the presented LRR method can 
be extended to solve various applications well. 

Appendix 

A. Terminologies 

In this subsection, we introduce some terminologies used in 
the paper. 

1) Block-Diagonal Matrix: In this paper, a matrix M is 
called block-diagonal if it has the form as in (1). For the matrix 
M which itself is not block-diagonal but can be transformed 
to be block-diagonal by simply permuting its rows and/or 
columns, we also say that M is block-diagonal. In summary, 
we say that a matrix M is block-diagonal whenever there 
exist two permutation matrices Pi and P2 such that P1MP2 
is block-diagonal. 

2) Union and Sum of Subspaces: For a collection of k sub- 
spaces {<Si, c>2, ' ' ' > $k}, their union is defined by U^ =1 Si = 
{y : y G Sj, for some 1 < j < k}, and their sum is 
defined by J2i=i S i = iv : V = Ej=i%>% £ S j}- If an Y 
y G Yli=i^i can b e uniquely expressed as y = X^=i % > 
yj G Sj , then the sum is also called the directed sum, denoted 
as Ei=i5i = 0jLi5». 



3) Independent Subspaces: A collection of k subspaces 
{Si, S2, • • • , Sk} are independent if and only if Si D 
Y^j^i $3 = {0} (° r Z)i=i $i = ©i=i«Si). When the subspaces 
are of low-rank and the ambient dimension is high, the inde- 
pendent assumption is roughly equal to the pairwise disjoint 
assumption; that is Si D Sj = {0}, Vz ^ j. 

4) Full SVD and Skinny SVD: For an m x n matrix 
M (without loss of generality, assuming m < n), its 
Singular Value Decomposition (SVD) is defined by M = 
U[E, 0] V T , where U and V are orthogonal matrices and E = 
diag (<ti, (72, • • • , cr m ) with {ai} 7 j f l 1 being singular values. The 
SVD defined in this way is also called the full SVD. If 
we only keep the positive singular values, the reduced form 
is called the skinny SVD. For a matrix M of rank r, its 
skinny SVD is computed by M = L r r E r ]^ T , where E r = 
diag {a\, a 2, • • • , cr r ) with {cFi} r i=1 being positive singular 
values. More precisely, U r and V r are formed by taking the 
first r columns of U and V, respectively. 

5) Pseudoinverse: For a matrix M with skinny SVD 
UTy J ', its pseudoinverse is uniquely defined by 

M f = VE -1 ^. 

6) Column Space and Row Space: For a matrix M, its 
column (resp. row) space is the linear space spanned by 
its column (resp. row) vectors. Let the skinny SVD of M 
be UTy J ', then U (resp. V) are orthonormal bases of the 
column (resp. row) space, and the corresponding orthogonal 
projection is given by UU T (resp. VV T ). Since UU T (resp. 
VV T ) is uniquely determined by the column (resp. row) space, 
sometimes we also use UU T (resp. VV T ) to refer to the 
column (resp. row) space. 

7) Affinity Degree: Let M be a symmetric affinity matrix 
for a collection of n data samples, the affinity degree of the 
z-th sample is defined by #{(j) : [M\ij 7^ 0}, i.e., the number 
of samples connected to the i-th sample. 

B. Proofs 

1) Proof of Theorem 4.1: The proof of Theorem 4.1 is 
based on the following three lemmas. 

Lemma 7.1: Let U, V and M be matrices of compatible 
dimensions. Suppose both U and V have orthogonal columns, 
i.e., U T U = I and V T V = I, then we have 

||M|L = \\umv t \u. 

Proof: Let the full SVD of M be M = U m ^mV^, then 
UMV T = (UUm)Zm(VVm) t . As (UU m ) t (UU m ) = I 
and {VV M ) T {VV M ) = I, (UU m )^m(V m V) t is actually an 
SVD of UMV T . By the definition of the nuclear norm, we 
have ||M||, =tr(S M ) = \\UMV T \\ tf . ■ 
Lemma 7.2: For any four matrices B, C, D and F of 
compatible dimensions, we have 

Id f] 

L J * 

where the equality holds if and only if C = 0,D = and 
F = 0. 
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Proof: The proof is simply based on the following fact: 
for any two matrices Mi and M2, we have 

||[Mi,M 2 ]||*>||Mi||* and/or ||[M i; M 2 ]||* > {{M^ 

and the equality can hold if and only if M2 = 0. ■ 
Lemma 7.3: Let U, V and M be given matrices of com- 
patible dimensions. Suppose both U and V have orthogonal 
columns, i.e., U T U = I and V T V = I, then the following 
optimization problem 

s.t. U T ZV 



mm\\Z\\ 



(15) 



has a unique minimizer Z* = UMV T . 

Proof: First, we prove that ||M||^ is the minimum 
objective function value and Z* = UMV T is a minimizer. 
For any feasible solution Z, let Z = Uz^zV^ be its full 
SVD. Let B = U T U Z and C 
U T ZV = M is equal to 



VgV. Then the constraint 



BZ Z C = M. 



(16) 



Since BB T = I and C T C = I, we can find the orthogonal 
complements B± and C±_ such that 



B 
B± 



and [C,C±] 



are orthogonal matrices. According to the unitary invariance 
of the nuclear norm, Lemma TH2\ and dT6b . we have 



\Z\L = \\^z\\ = 



> 



B 
B ± 

BYizC BYizC± 
B±Y<zC B±^zC± 

BEz<7||„ = ||M||„ , 



Ez[C,C± 



Hence, is the minimum objective function value of 

problem dT5b . At the same time, Lemma 17.11 proves that 

11^11* = ||^ M ^ T L = ll M L- So Z * = UMV T is a 
minimizer to problem dT5b . 

Second, we prove that Z* = UMV T is the unique min- 
imizer. Assume that Z\ = U MV T + H is another optimal 
solution. By U T Z{V = M, we have 

U T HV = 0. (17) 

Since U T U = I and V T V = I, similar to above, we can 
construct two orthogonal matrices: [U, U±] and [V, V±\. By 
the optimality of Zi, we have 

= IIZiH = \\UMV T 



Ml 



(UMV 1 



H\\ 

1 1 * 

-H)\V,V ± ] 



M U T HV ± 
UjHV UjHV±_ 



\M\ 



> 



According to Lemma 17.21 the above equality can hold if and 
only if 

U T HV ± = UlHV = UlHV±_ = 0. 

7 When B and/or C are already orthogonal matrices, i.e., B± = and/or 
C± = 0, our proof is still valid. 



Together with ([T71) , we conclude that H = 0. So the optimal 
solution is unique. ■ 
It is worth noting that Lemma 17.31 allows us to get closed- 
form solutions to a class of nuclear norm minimization prob- 
lems, and leads to a simple proof of Theorem 4.1. 

Proof: (of Theorem 4.1) Since X G span (A), we have 
rank ([X, A]) = rank (A). Let's define Vx and Va as follows: 
Compute the skinny SVD of the horizontal concatenation of 
X and A, denoted as [X, A] = UT I V T ', and partition V as 

V = [V X ;V A ] such that X = UY>V£ and A = UZV A T 
(note that Va and Vx may be not column-orthogonal). By this 
definition, it can be concluded that the matrix Vj has full row 
rank. That is, if the skinny SVD of Vj is U^t Vf, then Ui 
is an orthogonal matrix. Through some simple computations, 
we have 

Va(VJVa)- 1 = V^U?. (18) 

Also, it can be calculated that the constraint X = AZ is equal 
to V$ = VjZ, which is also equal to Z^UfVg = V?Z. So 
problem (5) is equal to the following optimization problem: 

mm||ZL, s.t. V?Z = ^ 1 UTV$. 
By Lemma 1731 and ([T8l) , problem (5) has a unique minimizer 

z* = ViE^uTvg = v A (vJv A )- 1 v£. 

Next, it will be shown that the above closed-form solution 
can be further simplified. Notice that Vj = T>~ 1 U T A and 

VI = T.-^X. Then we have 

z* = A T UTr 1 {Tr 1 u T AA T UTr 1 )- 1 Tr 1 u T x 

= A T U( U T AA T U) ~ 1 U T X 
= (U T A)^U T X 
= A+X, 

where the last equality is due to that (U T A)^U T = 

& A vTyu T = (ux A vjy = ^t. ■ 

2) Proof of Corollary 4.1: Proof: By X G span (A), 
we have rank (^X) = rank(X). Hence, rank(Z*) = 
rank(X). At the same time, for any feasible solution Z to 
problem (5), we have rank(Z) > rank (AZ) = rank(X). 
So, Z* is also optimal to problem (4). ■ 

3) Proof of Theorem 4.2: The proof of Theorem 4.2 is 
based on the following well-known lemma. 

Lemma 7.4: For any four matrices B, C, D and F of 
compatible dimensions, we have 



B 
D 



C 
F 



> 



\B\\ 



\F\\ 



B 
F 

The above lemma allows us to lower-bound the objective 
value at any solution Z by the value of the block-diagonal 
restriction of Z, and thus leads to a simple proof of Theorem 
4.2. 

Proof: Let Z* be the optimizer to problem (5). Form a 
block-diagonal matrix W by setting 

[Z]ij, [A]. mi i and [X] : j belong to 
[W]ij — < the same subspace, 

otherwise. 
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Write Q = Z* — W. For any data vector [-X] : j, without loss 
of generality, suppose [X] : j belongs to the z-th subspace; i.e., 
[AZ*] : j e Si. Then by construction, we have [AW] :> j e Si 
and [AQ] :J e But [AQ] :tj = [X] : j - [AW] : j c 

Si. By independence, we have Si n ® m7 ^5Vn = {0}, and so 
[AQ] :J = 0, Vj. 

Hence, AQ = 0, and W is feasible for (5). By Lemma 
17741 we have \\Z*\\* > \\W\\*. Also, by the uniqueness of the 
minimizer (see Theorem 4.1), we conclude that Z* = W and 
hence Z* is block-diagonal. 

Again, by the uniqueness of the minimizer Z*, we can 
conclude that for all z's, Z* is also the unique minimizer to 
the following optimization problem: 

mm || Jl^ , s.t. Xi=AiJ. 

By Corollary 4.1, we conclude that rank(Z*) = rank(X^). 

■ 

4) Proof of Theorem 4.3: Proof: Note that the LRR 
problem (7) always has feasible solution(s), e.g., (Z = 0, E = 
X) is feasible. So, an optimal solution, denoted as (Z*,E*), 
exists. By Theorem 4.1, we have 

Z* = argrrnn||Z||* s.t. X - E* = AZ 

= A\X -E*), 

which simply leads to Z* G span (A T ). ■ 

5) Proof of Theorem 5.3: Proof: Let the skinny SVD 
of X be UZV T . It is simple to see that (VV T , 0) is feasible 
to problem (9). By the convexity of (9), we have 

Pl*< 11^*11* + APHI < \\w T \u 

— rank (X) 
< min(<i, n). 

Hence, 

\\Z*-V,VT\\ F < ||Z*-^o T H*< 11^*11* + ll^o T H* 
= ||Z*||* + r < min(d,n) + r . 



C. Evaluation Metrics 

1 ) Segmentation Accuracy ( or Error): The segmentation re- 
sults can be evaluated in a similar way as classification results. 
Nevertheless, since segmentation methods cannot provide the 
class label for each cluster, a postprocessing step is needed to 
assign each cluster a label. A commonly used strategy is to 
try every possible label vectors that satisfy the segmentation 
results. The final label vector is chosen as the one that best 
matches the ground truth classification results. Such a global 
search strategy is precise, but inefficient when the subspace 
number k is large. Namely, the computational complexity is 
k\, which is higher than 2 k for k > 2. Hence, we suggest 
a local search strategy as follows: given the ground truth 
classification results, the label of a cluster is the index of 
the ground truth class that contributes the maximum number 
of samples to the cluster. This local search strategy is quite 
efficient because its computational complexity is only 0(k), 



and can usually produce the same evaluation results as global 
search. Nevertheless, it is possible that two different clusters 
are assigned with the same label. So, we use the local search 
strategy only when k > 10. 

2) Receiver Operator Characteristic: To evaluate the ef- 
fectiveness of outlier detection without choosing a parameter 
S for (14), we consider the receiver operator characteristic 
(ROC), which is widely used to evaluate the performance of 
binary classifiers. The ROC curve is obtained by trying all 
possible thresholding values, and for each value, plotting the 
true positives rate on the Y-axis against the false positive rate 
value on the X-axis. The areas under the ROC curve, known as 
AUC, provides a number for evaluating the quality of outlier 
detection. Note that the AUC score is the larger the better, and 
always ranges between and 1. 
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